Sound processing apparatus

ABSTRACT

The present invention relates to a voice recognition apparatus capable of easily registering a word which has not been registered. The registering of an unregistered word into a dictionary can be easily performed without causing a significant increase in the size of the dictionary. The clustering unit  29  detects a cluster (detected cluster) to which a new unregistered word is to be added as a new member, from existing clusters obtained by clustering unregistered words. The unregistered word is added as a new member to the detected cluster, and the cluster is divided depending on the members of the cluster such that unregistered words which are acoustically similar to each other belong to the same cluster. The maintenance unit  31  updates the word dictionary on the basis of the result of the clustering. The present invention may be applied to a robot including a voice recognition apparatus.

TECHNICAL FIELD

[0001] The present invention relates to a voice recognition apparatus,and more particularly, to a voice recognition apparatus capable ofeasily updating a dictionary in which a word or a phrase to berecognized is registered.

BACKGROUND ART

[0002] In conventional voice recognition apparatuses, recognition of avoice uttered by a user is performed by referring to a dictionary inwhich words to be recognized are registered.

[0003] Therefore, in the voice recognition apparatus, only words whichare registered in the dictionary (hereinafter, such words will bereferred to simply as registered words) can be recognized, and wordswhich are not registered in the dictionary cannot be recognized. Herein,words which are not registered in the dictionary are referred to asunregistered words. In the conventional voice recognition apparatus, ifan utterance made by a user includes an unregistered word, theunregistered word is recognized as one of words (registered words)registered in the dictionary, and thus the result of recognition of theunregistered word becomes wrong. If an unregistered word is recognizedincorrectly, the incorrect recognition can influence recognition of aword prior to or subsequent to the unregistered word, that is, can causesuch a word to be recognized incorrectly.

[0004] Therefore, it is required to properly deal with unregisteredwords so as to avoid the above problem. To this end, various techniqueshave been proposed.

[0005] For example, Japanese Unexamined Patent Application PublicationNo. 9-81181 discloses a voice recognition apparatus in which a garbagemodel for detecting an unregistered word and an HMM (Hidden MarkovModel) associated with phonemes such as vowels are simultaneously usedso as to limit phoneme sequences associated with the unregistered wordthereby making it possible to detect the unregistered word withoutneeding complicated calculations.

[0006] As another example, Japanese Patent Application No. 11-245461discloses an information processing apparatus in which when a word setincluding an unregistered word is given, the similarity between theunregistered word which is not included in a database and a wordincluded in the database is calculated on the basis of the concepts ofwords, and a sequence of properly arranged words is produced and output.

[0007] As still another example, “Dictionary Learning: PerformanceThrough Consistency” (Tilo Sloboda, Proceedings of ICASSP 95, vol. 1,pp. 453-456, 1995) discloses a technique in which phoneme sequencescorresponding to voice periods of words are detected and phonemesequences which are acoustically similar to each other are deleted usinga confusion matrix thereby effectively constructing a dictionaryincluding variants.

[0008] As still another example, “Estimation of Transcription of UnknownWord from Speech Samples in Word Recognition” (Katsunobu Ito, et at.,The Transactions of the Institute of Electronics, Information, andCommunication Engineers, Vol. J83-D-II, No. 11, pp. 2152-2159, November,2000) discloses a technique of improving estimation accuracy of aphoneme sequence when the phoneme sequence is estimated from a pluralityof speech samples and an unknown (unregistered) word is registered in adictionary.

[0009] One typical method for dealing with an unregistered word is to,if an unregistered word is detected in an input voice, register theunregistered word into a dictionary and treat it as an registered wordthereafter.

[0010] In order to register an unregistered word into a dictionary, itis required to first detect a voice period of that unregistered word andthen recognize the phoneme sequence of the voice in the voice period.The recognition of the phoneme sequence of a voice can be accomplished,for example, by a method known as a phoneme typewriter. In the phonemetypewriter, a phoneme sequence corresponding to an input voice isbasically output using a garbage model which accepts any phonemicchange.

[0011] When an unregistered word is registered into a dictionary, it isrequired to cluster the phoneme sequence of the unregistered word. Thatis, in the dictionary, the phoneme sequence of each word is registeredin the form of a cluster corresponding to the word, and thus, toregister an unregistered word into the dictionary, it is required tocluster the phoneme sequence of the unregistered word.

[0012] One method of clustering the phoneme sequence of an unregisteredword is to input, by a user, an entry (for example, a pronunciation ofthe unregistered word) indicating the unregistered word and then clusterthe phoneme sequence of the unregistered word into a cluster indicatedby the that entry. However, in this method, the user has to do atroublesome task to input the entry.

[0013] Another method is to produce a new cluster each time anunregistered word is detected such that the phoneme sequence of theunregistered word is clustered into the newly produced cluster. However,in this method, an entry corresponding to the new cluster is registeredinto a dictionary each time an unregistered word is detected, and thusthe size of the dictionary increases as unregistered words areregistered. As a result, a greater time and a greater amount of processare necessary in voice recognition performed thereafter.

DISCLOSURE OF INVENTION

[0014] In view of the above, it is an object of the present invention toprovide a technique of easily registering an unregistered word into adictionary without causing a significant increase in the size of thedictionary.

[0015] The present invention provides a voice recognition apparatuscomprising cluster detection means for detecting, from existing clustersobtained by clustering voices, a cluster to which an input voice is tobe added as a new member; cluster division means for employing the inputvoice as the new member of the cluster detected by the cluster detectionmeans and dividing the cluster depending on members of the cluster; andupdate means for updating the dictionary on the basis of a result ofdivision performed by the cluster division means.

[0016] The present invention provides a voice recognition methodcomprising the steps of detecting, from existing clusters obtained byclustering voices, a cluster to which the input voice is to be added asa new member; employing the input voice as the new member of the clusterdetected in the cluster detection step and dividing the clusterdepending on members of the cluster; and updating the dictionary on thebasis of a result of division performed in the cluster division step.

[0017] The present invention provides a program comprising the steps ofdetecting, from existing clusters obtained by clustering voices, acluster to which the input voice is to be added as a new member;employing the input voice as the new member of the cluster detected inthe cluster detection step and dividing the cluster depending on membersof the cluster; and updating the dictionary on the basis of a result ofdivision performed in the cluster division step.

[0018] The present invention provides a storage medium including aprogram, stored therein, comprising the steps of detecting, fromexisting clusters obtained by clustering voices, a cluster to which theinput voice is to be added as a new member; employing the input voice asthe new member of the cluster detected in the cluster detection step anddividing the cluster depending on members of the cluster; and updatingthe dictionary on the basis of a result of division performed in thecluster division step.

[0019] In the present invention, from existing clusters obtained byclustering voices, a cluster to which an input voice is to be added as anew member is detected. The input voice is added as a new member to thedetected cluster and the cluster is divided depending on the members ofthe cluster. In accordance with a result of the division, the dictionaryis updated.

BRIEF DESCRIPTION OF THE DRAWINGS

[0020]FIG. 1 is a perspective view showing an example of an outwardstructure of a robot according to an embodiment of the presentinvention.

[0021]FIG. 2 is a block diagram showing an example of an internalstructure of the robot.

[0022]FIG. 3 is a block diagram showing an example of a functionalstructure of a controller of the robot shown in FIG. 1.

[0023]FIG. 4 is a block diagram showing an example of a construction ofa voice recognition apparatus according to an embodiment of the presentinvention, wherein the voice recognition apparatus is used as a voicerecognition unit of the robot shown in FIG. 1.

[0024]FIG. 5 is a diagram showing a word dictionary.

[0025]FIG. 6 is a diagram showing grammatical rules.

[0026]FIG. 7 is a diagram showing contents stored in a feature vectorbuffer of the voice recognition unit shown in FIG. 4.

[0027]FIG. 8 is a diagram showing a score sheet.

[0028]FIG. 9 is a flow chart showing a voice recognition processperformed by the voice recognition unit shown in FIG. 4.

[0029]FIG. 10 is a flow chart showing the details of unregistered wordprocessing shown in FIG. 9.

[0030]FIG. 11 is a flow chart showing the details of a cluster divisionprocess shown in FIG. 9.

[0031]FIG. 12 is a diagram showing a result of a simulation.

[0032]FIG. 13 is a diagram showing an example of a hardwareconfiguration of a voice recognition apparatus according to a secondembodiment of the present invention.

[0033]FIG. 14 is a diagram showing an example of a softwareconfiguration of the voice recognition apparatus shown in FIG. 13.

[0034]FIG. 15 is a diagram showing contents stored in a feature vectorbuffer of the voice recognition unit shown in FIG. 14.

[0035]FIG. 16 is a flow chart showing a voice recognition processperformed by the voice recognition apparatus shown in FIG. 14.

[0036]FIG. 17 is a flow chart showing the details of an unregisteredword deletion process shown in FIG. 16.

BEST MODE FOR CARRYING OUT THE INVENTION

[0037]FIG. 1 shows an example of an outward structure of a robotaccording to an embodiment of the present invention, and FIG. 2 shows anexample of an electric configuration thereof.

[0038] In the present embodiment, the robot is constructed into the formof an animal having four legs, such as a dog, wherein leg units 3A, 3B,3C, and 3D are attached, at respective four corners, to a body unit 2,and a head unit 4 and a tail unit 5 are attached, at front and bockends, to the body unit 2.

[0039] The tail unit 5 extends from a base 5B disposed on the uppersurface of the body unit 2 such that the tail unit 5 can bend or shakewith two degree of freedom.

[0040] Inside the body unit 2, there are disposed a controller 10 forgenerally controlling the robot, a battery 11 serving as a power sourceof the robot, and an internal sensor unit 14 including a battery sensor12 and a heat sensor 13.

[0041] On the head unit 4, there are disposed, at properly selectedposition, a microphone 15 serving as an ear, a CCD (Charge CoupledDevice) camera 16 serving as an eye, a touch sensor 17 serving as asense-of-touch sensor, and a speaker 18 serving as a mouth. A lower jawunit 4A serving as a lower jaw of the mouth is attached to the head unit4 such that the lower jaw unit 4A can move with one degree of freedom.The mouth of the robot can be opened and closed by moving the lower jawunit 4A.

[0042] As shown in FIG. 2, actuators 3AA₁ to 3AA_(K), 3BA₁ to 3BA_(K),3CA₁ to 3CA_(K), 3DA₁ to 3DA_(K), 4A₁ to 4A_(L), 5A₁, and 5A₂ arerespectively disposed in joints for joining parts of the leg units 3A to3D, joints for joining the leg units 3A to 3D with the body unit 2, ajoint for joining the head unit 4 with the body unit 2, a joint forjoining the head unit 4 with the lower jaw unit 4A, and a joint forjoining the tail unit 5 with the body unit 2.

[0043] The microphone 15 disposed on the head unit 4 collects a voice(sound) including an utterance made by a user from the environment andtransmits an obtained voice signal to the controller 10. The CCD camera16 takes an image of the environment and transmits an obtained imagesignal to the controller 10.

[0044] The touch sensor 17 is disposed on an upper part of the head unit4. The touch sensor 17 detects a pressure applied by the user as aphysical action such as “rubbing” or “tapping” and transmits a pressuresignal obtained as the result of the detection to the controller 10.

[0045] The battery sensor 12 disposed in the body unit 2 detects theremaining capacity of the battery 11 and transmits the result of thedetection as a battery remaining capacity signal to the controller 10.

[0046] The heat sensor 13 detects heat in the inside of the robot andtransmits information indicating the detected heat, as a heat signal, tothe controller 10.

[0047] The controller 10 detects a CPU (Central Processing Unit) 10A anda memory 10B. The controller 10 performs various processes by executing,using the CPU 10A, a control program stored in the memory 10B.

[0048] The controller 10 detects an environment state, a command givenby a user, and an action of a user applied to the robot on the basis ofthe voice signal, the image signal, the pressure signal, the batteryremaining capacity signal, and the heat signal supplied from themicrophone 15, the CCD camera 16, the touch sensor 17, the batterysensor 12, and the heat sensor 13, respectively.

[0049] On the basis of the parameters detected above, the controller 10makes a decision as to how to act. In accordance with the decision, thecontroller 10 activates necessary actuators of those including actuators3AA₁ to 3AA_(K), 3BA₁ to 3BA_(K), 3CA₁ to 3CA_(K), 3DA₁ to 3DA_(K), 4A₁to 4A_(L), 5A₁, and 5A₂, so as to nod or shake the head unit 4 or openand close the lower jaw unit 4A. Depending on the situation, thecontroller 10 moves the tail unit 5 or makes the robot walk by movingthe leg units 3A to 3D.

[0050] Furthermore, as required, the controller 10 produces synthesizedvoice data and supplies it to the speaker 18 thereby generating a voice,or turns on/off or blinks LEDs (Light Emitting Diode, not shown in thefigures) disposed on the eyes of the robot.

[0051] As described above, the robot autonomously acts in response tothe environmental conditions.

[0052]FIG. 3 shows the functional structure of the controller 10 shownin FIG. 2. Note that the functional structure shown in FIG. 3 isrealized by executing, using the CPU 10A, the control program stored inthe memory 10B.

[0053] The controller 10 includes a sensor input processing unit 50 forrecognizing a specific external state, a model memory 51 for storingresults of recognition performed by the sensor input processing unit 50and representing a state of emotion, instinct, or growth, an actiondecision unit 52 for determining an action to be taken next inaccordance with the result of recognition performed by the sensor inputprocessing unit 50, an attitude change mechanism 53 for making the robotactually take an action in accordance with the decision made by theaction decision unit 52, a control unit 54 for controlling actuators3AA₁ to 5A₁ and 5A₂, and a voice synthesis unit 55 for producing asynthesized voice.

[0054] The sensor input processing unit 50 detects specific externalconditions, an action of a user applied to the robot, and a commandgiven by the user, on the basis of the voice signal, the image signal,and the pressure signal supplied from the microphone 15, the CCD camera16, and the touch sensor 17, respectively. Information indicating thedetected conditions is supplied as state recognition information to themodel memory 51 and the action decision unit 52.

[0055] More specifically, the sensor input processing unit 50 includes avoice recognition unit 50A for recognizing the voice signal suppliedfrom the microphone 15. For example, if a given voice signal isrecognized by the voice recognition unit 50A as a command such as“walk”, “lie down”, or “follow the ball”, the recognized command issupplied as state recognition information from the voice recognitionunit 50A to the model memory 51 and the action decision unit 52.

[0056] The sensor input processing unit 50 also includes an imagerecognition unit 50B for recognizing an image signal supplied from theCCD camera 16. For example, if the image recognition unit 50B detects,via the image recognition process, “something red and round” or a “planeextending vertical from the ground to a height greater than apredetermined value”, then the image recognition unit 50B suppliesinformation indicating the recognized state of the environment such as“there is a ball” or “there is a wall” as state recognition informationto the model memory 51 and the action decision unit 52.

[0057] The sensor input processing unit 50 further includes a pressureprocessing unit 50C for processing a detected pressure signal suppliedfrom the touch sensor 17. For example, if the pressure processing unit50C detects a pressure higher than a predetermined threshold for a shortduration, the pressure processing unit 50C recognizes that the robot hasbeen “tapped (scolded)”. In a case in which the detected pressure islower in magnitude than a predetermined threshold and long in duration,the pressure processing unit 50C recognizes that that the robot has been“rubbed (praised)”. Information indicating the result of recognition issupplied as state recognition information to the model memory 51 and theaction decision unit 52.

[0058] The model memory 51 stores and manages an emotion model, aninstinct model, and a growth model representing the states of the robotconcerning emotion, instinct, and growth, respectively.

[0059] The emotion model represents the state (degree) of emotionconcerning, for example, “happiness”, “sadness”, “angriness”, and“pleasure” using values within predetermined ranges (for example, from−1.0 to 1.0), wherein the values are varied depending on the staterecognition information supplied from the sensor input processing unit50 and depending on the passage of time. The instinct model representsthe state (degree) of instinct concerning, for example, “appetite”,“desire for sleep”, and “desire for exercise” using values withinpredetermined ranges, wherein the values are varied depending on thestate recognition information supplied from the sensor input processingunit 50 and depending on the passage of time. The growth modelrepresents the state (degree) of growth, such as “childhood”, “youth”,“middle age” and “old age” using values within predetermined ranges,wherein the values are varied depending on the state recognitioninformation supplied from the sensor input processing unit 50 anddepending on the passage of time.

[0060] The states of emotion, instinct, and growth, represented byvalues of the emotion model, the instinct model, and the growth model,respectively, are supplied as state information from the model memory 51to the action decision unit 52.

[0061] In addition to the state recognition information supplied fromthe sensor input processing unit 50, the model memory 51 also receives,from the action decision unit 52, action information indicating acurrent or past action of the robot, such as “walked for a long time”,thereby allowing the model memory 51 to produce different stateinformation for the same state recognition information, depending on therobot's action indicated by the action information.

[0062] More specifically, for example, when the robot greets the user,if the user rubs the head of the robot, then action informationindicating that the robot greeted the user and state recognitioninformation indicating that the head was rubbed are supplied to themodel memory 51. In this case, in response, the model memory 51increases the value of the emotion model indicating the degree ofhappiness.

[0063] On the other hand, if the robot is rubbed on the head when therobot is doing a job, action information indicating that the robot isdoing a job and state recognition information indicating that the headwas rubbed are supplied to the model memory 51. In this case, the modelmemory 51 does not increase the value of the emotion model indicatingthe degree of “happiness”.

[0064] As described above, the model memory 51 sets the values of theemotion model on the basis of not only the state recognition informationbut also the action information indicating the current or past action ofthe robot.

[0065] This prevents the robot from having an unnatural change inemotion. For example, even if the user rubs the head of the robot withintension of playing a trick on the robot when the robot is doing sometask, the value of the emotion model associated with “happiness” is notincreased unnaturally.

[0066] For the instinct model and the growth model, the model memory 51also increases or decreases the values on the basis of both the staterecognition information and the action information, as for the emotionmodel. Furthermore, when the model memory 51 increases or decreases avalue of one of the emotion model, the instinct model, and the growthmodel, the values of the other models are taken into account.

[0067] The action decision unit 52 decides an action to be taken next onthe basis of the state recognition information supplied from the sensorinput processing unit 50, the state information supplied from the modelmemory 51, and the passage of time. The content of the decided action issupplied as action command information to the attitude changing unit 53.

[0068] More specifically, the action decision unit 52 manages a finiteautomaton, which can take states corresponding to the possible actionsof the robot, as an action model which determines the action of therobot such that the state of the finite automaton serving as the actionmodel is changed depending on the state recognition information suppliedfrom the sensor input processing unit 50, the values of the model memory51 associated with the emotion model, the instinct model, and the growthmodel, and the passage of time, and the action decision unit 52 employsthe action corresponding to the changed state as the action to be takennext.

[0069] In the above process, when the action decision unit 52 detects aparticular trigger, the action decision unit 52 changes the state. Morespecifically, the action decision unit 52 changes the state, forexample, when the period of time in which the action corresponding tothe current state has been performed has reached a predetermined value,or when specific state recognition information has been received, orwhen the value of the state of the emotion, instinct, or growthindicated by the state information supplied from the model memory 51becomes lower or higher than a predetermined threshold.

[0070] Because, as described above, the action decision unit 52 changesthe state of the action model not only depending on the staterecognition information supplied from the sensor input processing unit50 but also depending on the values of the emotion model, the instinctmodel, and the growth model of the model memory 51, the state to whichthe current state is changed can be different depending on the values(state information) of the emotion model, the instinct model, and thegrowth model even when the same state recognition information is input.

[0071] For example, when the state information indicates that the robotis not “angry” and is not “hungry”, if the state recognition informationindicates that “a user's hand with its palm facing up is held in frontof the face of the robot”, the action decision unit 52 produces, inresponse to the hand being held in front of the face of the robot,action command information indicating that shaking should be performedand the action decision unit 52 transmits the produced action commandinformation to the attitude changing unit 53.

[0072] On the other hand, for example, when the state informationindicates that the robot is not “angry” but “hungry”, if the staterecognition information indicates that “a user's hand with its palmfacing up is held in front of the face of the robot”, the actiondecision unit 52 produces, in response to the hand being held in frontof the face of the robot, action command information indicating that therobot should lick the palm of the hand and the action decision unit 52transmits the produced action command information to the attitudechanging unit 53.

[0073] When the state information indicates that the robot is angry, ifthe state recognition information indicates that “a user's hand with itspalm facing up is held in front of the face of the robot”, the actiondecision unit 52 produces action command information indicating that therobot should turn its face aside regardless of whether the stateinformation indicates that the robot is or is not “hungry”, and theaction decision unit 52 transmits the produced action commandinformation to the attitude changing unit 53.

[0074] In addition to above-described action command informationassociated with motions of various parts of the robot such as the head,hand, legs, etc., the action decision unit 52 also produces actioncommand information for causing the robot to utter. The action commandinformation for causing the robot to utter is supplied to the voicesynthesizing unit 55. The action command information supplied to thevoice synthesizing unit 55 includes a text or the like corresponding toa voice to be synthesized by the voice synthesizing unit 55. If thevoice synthesizing unit 55 receives the action command information fromthe action decision unit 52, the voice synthesizing unit 55 produces asynthesized voice in accordance with the text included in the actioncommand information and supplies it to the speaker 18, which in turnsoutputs the synthesized voice.

[0075] Thus, the speaker 18 outputs a voice of a cry, a voice “I amhungry” to request the user for something, or a voice “What?” to respondto a call from the user. When a synthesized voice is output, the actiondecision unit 52 produces action command information to open and closethe lower jaw unit 4A as required and supplies it to the attitudechanging unit 53. In response, the lower jaw unit 4A is opened andclosed in synchronization with outputting of the synthesized voice. Thiscan give the user an impression that the robot is actually speaking.

[0076] In accordance with the action command information supplied fromthe action decision unit 52, the attitude changing unit 53 producesattitude change command information for changing the attitude of therobot from the current attitude to a next attitude and transmits it tothe control unit 54.

[0077] In accordance with the attitude change command informationreceived from the attitude changing unit 53, the control unit 54produces a control signal for driving the actuators 3AA₁ to 5A₁ and 5A₂and transmits it to the actuators 3AA₁ to 5A₁ and 5A₂. Thus, inaccordance with the control signal, the actuators 3AA₁ to 5A₁ and 5A₂are driven such that the robot acts autonomously.

[0078]FIG. 4 shows an example of a construction of the voice recognitionunit 50A shown in FIG. 3.

[0079] A voice signal output from the microphone 15 is supplied to an AD(Analog Digital) converter 21. The AD converter 21 samples and quantizesthe voice signal in the form of an along signal supplied from themicrophone 15 so as to convert it into voice data in the form of adigital signal. The resultant voice data is supplied to a featureextraction unit 22.

[0080] The feature extraction unit 22 performs, for example, MFCC (MelFrequency Cepstrum Coefficient) analysis on voice data input thereto ona frame-by-frame basis, wherein units of frames are properly selected.The MFCCs obtained via the analysis are output in the form of a featurevector (feature parameters) to a matching unit 23 and an unregisteredword period processing unit 27. Alternatively, the feature extractionunit 22 may extract, as a feature vector, linear predictioncoefficients, cepstrum coefficients, a line spectrum pair, or power foreach of particular frequency bands (output from filter banks).

[0081] Using the feature vector supplied from the feature extractionunit 22 and referring, as required, to the acoustic model memory 24, thedictionary memory 25, and the grammar memory 26, the matching unit 23recognizes the voice (input voice) input to the microphone 15 on thebasis of, for example, a continuous density HMM (Hidden Markov Model).

[0082] The acoustic model memory 24 stores an acoustic modelrepresenting acoustic features of respective subwords such as phonemesand syllables in a language of the voice to be recognized (wherein theacoustic model may include, in addition to the HMM, standard patternsused in the DP (Dynamic Programming) matching)) In the presentembodiment, voice recognition is performed on the basis of thecontinuous density HMM method, and thus the HMM (Hidden Markov Model) isused as the acoustic model.

[0083] The dictionary memory 25 stores a dictionary in which words to berecognized are described in such a manner that the entry of each word isrelated to information (phonemic information) indicating thepronunciation of that word in the form of a cluster of phonemes.

[0084]FIG. 5 shows the word dictionary stored in the dictionary memory25.

[0085] In the word dictionary, as shown in FIG. 5, entries of respectivewords are related to corresponding phoneme sequences, wherein thephoneme sequences are given in the form of clusters corresponding to therespective words. In the word dictionary shown in FIG. 5, one entry (onerow in FIG. 3) corresponds to one cluster.

[0086] In the example shown in FIG. 5, each entry is represented inRoman characters and also in Japanese characters (kanji and kanacharacters), and each phoneme sequence is represented in Romancharacters. (In this example, Japanese words “

”, “

”, “

”, “

”, “

”, “

”, “

”, and “

” are pronounced “boku”, “chigau”, “doko”, “genki”, “iro”, “janai”,“kirai”, and “kudasai”, respectively, and correspond to English words“I”, “different”, “where”, “fine”, “color”, “not”, “dislike”, and“please”, respectively.) In the representations of the phonemesequences, “N” indicates a syllabic nasal corresponding to a Japanesephonetic symbol “

”. Although in the example shown in FIG. 5, one phoneme sequence isdescribed in association with each entry, a plurality of phonemesequences may be described in association with one entry.

[0087] Referring again to FIG. 4, the grammar memory 26 storesgrammatical rules indicating how those words registered in the worddictionary stored in the dictionary memory 25 are concatenated(connected).

[0088]FIG. 6 shows grammatical rules stored in the grammar memory 26.The grammatical rules shown in FIG. 6 are described in an EBNF (ExtendedBackus Naur Form).

[0089] In FIG. 6, one grammatical rule is described from the beginningof each row to a point at which “;” first appears. Each alphabeticcharacter (character string) prefixed with “$” denotes a variable, andeach alphabetic character (character string) with no prefix of “$”represents an entry (corresponding to an entry described in Romancharacters in FIG. 5) of a word. Square brackets [ ] are used toindicate that a part enclosed therein can be omitted. “|” is used toindicate that either one of words (or variables) located before andafter “|” of an entry can be selected.

[0090] More specifically, for example, in the first row of FIG. 6, thegrammatical rule “$col=[kono|sono] iro wa;” indicates that variable $colis a word string of “kono iro wa” or “sono iro wa”. (Herein, “kono” and“sono” are Japanese words corresponding to English words “this” and“that”, respectively, and “wa” is a particle in the Japanese languagehaving no corresponding English word.)

[0091] Although variables $sil and $garbage are not defined in thegrammatical rules shown in FIG. 6, variable $sil denotes a silentacoustic model (silent model), and variable $garbage denotes a garbagemodel which basically allows an arbitrary transition among phonemes.

[0092] Referring again to FIG. 4, the matching unit 23 creates anacoustic model of a word (word model) by concatenating acoustic modelsstored in the acoustic model memory 24 in accordance with the worddictionary stored in the dictionary memory 25. Furthermore, the matchingunit 23 concatenates some word models in accordance with grammaticalrules stores in the grammar memory 26, and the matching unit 23recognizes the voice input to the microphone 15 using the concatenatedword models and the feature vectors in accordance with the continuousdensity HMM method.

[0093] More specifically, the matching unit 23 detects a sequence ofword models having a highest score (likelihood) indicating theprobability that a time sequence of feature vectors output from thefeature extraction unit 22 is observed, and a sequence of entriescorresponding to the sequence of word models is output as a result ofvoice recognition.

[0094] More specifically, the matching unit 23 determines the score forthe sequence of word models by calculating the sum of occurrenceprobabilities (output probabilities) of the feature vectors of thesequence or words corresponding to the concatenated word models, and thematching unit 23 outputs, as a result of voice recognition, a sequenceof entries corresponding to a sequence of word models having a highestscore.

[0095] The result of the voice recognition on the voice input to themicrophone 15 is output as state recognition information to the modelmemory 51 and the action decision unit 52.

[0096] In the example shown in FIG. 6, a grammatical rule using variable$garbage indicating a garbage model (hereinafter, such a grammaticalrule will be referred to as a grammatical rule including unregisteredword) “$part1=$color1 $garbage $color2;” is described in the ninth row(as counted from the top row). Thus, when this grammatical ruleincluding unregistered word is applied, the matching unit 23 detects avoice period corresponding to variable $garbage as a voice period of anunregistered word. The matching unit 23 then detects a phoneme sequenceof the unregistered word by detecting a phoneme sequence correspondingto variable $garbage described in the grammatical rule includingunregistered word, wherein variable $garbage accepts arbitrary phonemictransitions. The voice period and the phoneme sequence of theunregistered word detected as the result of the voice recognitionaccording to the grammatical rule including unregistered word aresupplied from the matching unit 23 to the unregistered word periodprocessing unit 27.

[0097] Although in the example described above, when the grammaticalrule including unregistered word “$part1=$color1 $garbage $color2;” isapplied, one unregistered word is detected between a phoneme sequence ofa word (word sequence) indicated by variable $color1 and registered inthe word dictionary and a phoneme sequence of a word (word sequence)indicated by variable $color2 and registered in the word dictionary, thepresent invention can also be applied to a case in which an utteranceincludes a plurality of unregistered words or to a case in which anunregistered word is not located between words (word sequences)registered in the word dictionary.

[0098] The unregistered word period processing unit 27 temporarilystores a sequence of feature vectors (feature vector sequence) suppliedfrom the feature extraction unit 22. If the unregistered word periodprocessing unit 27 receives the voice period and the phoneme sequence ofthe unregistered word from the matching unit 23, the unregistered wordperiod processing unit 27 detects a feature vector sequence of the voicein that voice period from the temporarily stored feature vectorsequence. The unregistered word period processing unit 27 assigns aunique ID (identification) to the phoneme sequence (unregistered word)supplied from the matching unit 23 and supplies it, together wit thephoneme sequence of the unregistered word and the feature vectorsequence of the voice period, to the feature vector buffer 28.

[0099] The feature vector buffer 28 temporarily stores the ID, thephoneme sequence, and the feature vector sequence of the unregisteredword, supplied from unregistered word period processing unit 27, suchthat they are related to each other as shown in FIG. 7.

[0100] In FIG. 7, sequential numerals starting from 1 are assigned asIDs to respective unregistered words. Therefore, for example, when thefeature vector buffer 28 includes IDs, phoneme sequences, and featurevector sequences for N unregistered words already stored therein, if thematching unit 23 detects a voice period and a phoneme sequence ofanother unregistered word, the unregistered word period processing unit27 assigns N+1 as an ID to the detected unregistered word, and the ID,the phoneme sequence, and the feature vector sequence of thatunregistered word are stored into the feature vector buffer 28 asrepresented by broken lines in FIG. 7.

[0101] Referring again to FIG. 4, the clustering unit 29 calculates thescores for the unregistered word newly stored in the feature vectorbuffer 28 (hereinafter, such an unregistered word will be referred to asa new unregistered word) with respect to the other unregistered wordsalready stored in the feature vector buffer 28 (hereinafter, such anunregistered word will be referred to as an already-stored unregisteredword).

[0102] That is, the clustering unit 29 regards the new unregistered wordas an input voice and the already-stored unregistered words as wordsregistered in the word dictionary, and the clustering unit 29 calculatesthe scores of the new unregistered word with respect to the respectivealready-stored unregistered words, in a similar manner as with thematching unit 23. More specifically, the clustering unit 29 detects thefeature vector sequence of the new unregistered word by referring to thefeature vector buffer 28, and concatenates acoustic models in accordancewith the phoneme sequences of the already-stored unregistered words. Theclustering unit 29 then calculates the scores indicating the likelihoodthat the feature vector sequence of the new unregistered word isobserved in the concatenated acoustic models.

[0103] As for the acoustic models used in the above process, thosestored in the acoustic model memory 24 are employed.

[0104] Similarly, the clustering unit 29 calculates the score of eachalready-stored unregistered word with respect to the new unregisteredword and updates the score sheet stored in the score sheet memory 30using the calculated scores.

[0105] The clustering unit 29 refers to the updated score sheet anddetects, from existing clusters obtained by clustering unregisteredwords (already-stored unregistered words), a cluster to which the newunregistered word is to be added as a new member. The clustering unit 29adds the new unregistered word as a new member to the detected clusterand divides that cluster depending on the members of that cluster. Theclustering unit 29 then updates the score sheet stored in the scoresheet memory 30 on the basis of the result of the division.

[0106] The score sheet memory 30 stores the score sheet in which scoresof the new unregistered word with respect to the already-storedunregistered words and scores of the already-stored unregistered wordsassociated the new unregistered word are registered.

[0107]FIG. 8 shows the score sheet.

[0108] The score sheet includes entries in which the “ID”, “phonemesequence”, “cluster number”, “representative member ID”, and “score” ofeach registered word are described.

[0109] For each unregistered word, the same “ID” and “phoneme sequence”as those stored in the feature vector buffer 28 are registered by theclustering unit 29 into the score sheet. The “cluster number” is anumeral identifying a cluster including, as a member, an unregisteredword of an entry, wherein the cluster number is assigned by theclustering unit 29 and registered in the score sheet. The“representative member ID” is an ID of an unregistered word employed asa representative member of a cluster including, as a member, theunregistered word of the entry. On the basis of a representative memberID, it is possible to identify the representative member of a clusterincluding, as a member, an unregistered word. The representative memberof a cluster is determined by the clustering unit 29, and the ID of therepresentative member is registered in a representative member ID fieldin the score sheet. The “score” is a score of an unregistered word of anentry with respect to each of the other unregistered words. As describedearlier, the score is calculated by the clustering unit 29.

[0110] For example, if IDs, phoneme sequences, and feature vectorsequences of N unregistered words are currently stored in the featurevector buffer 28, then the score sheet includes IDs, phoneme sequences,cluster numbers, representative member IDs, and scores of those Nunregistered words.

[0111] When the ID, the phoneme sequence, and the feature vectorsequence of the new unregistered word are newly stored into the featurevector buffer 28, the clustering unit 29 updates the score sheet asrepresented by broken lines in FIG. 8.

[0112] More specifically, the ID, the phoneme sequence, the clusternumber, and the representative member ID of the new unregistered word,and also the scores (s(N+1, 1), s(N+1, 2), . . . , s(N+1, N) shown inFIG. 8) of the unregistered word with respect to the already-storedunregistered words are added to the score sheet. Furthermore, the scores(s (1, N+1), s(2, N+1), . . . , s(N, N+1) shown in FIG. 8) of therespective already-stored unregistered words with respect to the newunregistered word are added to the score sheet. Thereafter, as will bedescribed later, cluster numbers and representative member IDs describedin the score sheet are updated as required.

[0113] In the example shown in FIG. 8, s(i, j) denotes the score of anunregistered word (utterance of an unregistered word) having an ID of iwith respect to an unregistered word (phoneme sequence of anunregistered word) having an ID of j.

[0114] In the score sheet (FIG. 8), the score, s(i, i), of anunregistered word (utterance of an unregistered word) having an ID of iwith respect to the unregistered word (phoneme sequence of theunregistered word) having the ID of i is also registered. This scores(i, i) is calculated by the matching unit 23 when the matching unit 23detects the phoneme sequence of the unregistered word, and thus theclustering unit 29 does not need to calculate this score.

[0115] Referring again to FIG. 4, the maintenance unit 31 updates theword dictionary stored in the dictionary memory 25 on the basis of theupdated score sheet stored in the score sheet memory 30.

[0116] Herein, the representative member of a cluster is determined asfollows. For example, of unregistered words included as members in acluster, an unregistered word which is highest in the sum of scores withrespect to the other unregistered words (or which is highest in meanscore obtained by dividing the sum of scores by the number of the otherunregistered words) is selected as the representative member of thatcluster. That is, if the member ID of a member belonging to a cluster isdenoted by k, a member having an ID of K (εk) given by the followingequation is selected as the representative member.

K=max_(K) {Σs(k′, k)}  (1)

[0117] In equation (1), max_(K){ }denotes k which gives a maximum valueto the value enclosed in braces { }, k′ denotes, as with k, an ID of amember of the cluster, and Σ denotes the sum taken by changing k′ overall IDs of the cluster.

[0118] In the process of determining the representative member in theabove-described manner, if a cluster includes one or two unregisteredwords as its members, the representative member can be determinedwithout having to calculate the scores. That is, in a case in which acluster includes only one member of an unregistered word, this oneunregistered word is selected as the representative member. On the otherhand, when a cluster includes only two unregistered words as members,either one of the two unregistered words may be selected as therepresentative member.

[0119] The method of determining the representative member is notlimited to that described above. For example, of unregistered wordsincluded as members in a cluster, an unregistered word which is smallestin the sum of distances in the feature vector space from thatunregistered word to the other unregistered words may be selected as therepresentative member of that cluster.

[0120] The voice recognition unit 50A constructed in the above-describedmanner performs a voice recognition process for recognizing a voiceinput to the microphone 15 and performs unregistered word processing onunregistered words.

[0121] First, with reference to a flow chart shown in FIG. 9, the voicerecognition process is described.

[0122] If a user utters, the uttered voice is input to the microphone 15and converted into digital voice data by the AD converter 21. Theresultant digital voice data is supplied to the feature extraction unit22. The feature extraction unit 22 performs, in step S1, acousticanalysis on the received voice data on a frame-by-frame basis, whereinunits of frames are properly determined. The feature extraction unit 22extracts feature vector via the above acoustic analysis and supplies theobtained feature vector sequence to the matching unit 23 and theunregistered word period processing unit 27.

[0123] In step S2, the matching unit 23 calculates the scores for thefeature vector sequence supplied from the feature extraction unit 23, inthe above-described manner. Thereafter, the process proceeds to step S3.In step S3, on the basis of the scores obtained via the above scorecalculation, the matching unit 23 determines a sequence of entries ofwords indicating the result of voice recognition and outputs theresultant sequence of entries of words.

[0124] In the next step S4, the matching unit 23 determines whether thevoice uttered by the user includes an unregistered word.

[0125] If it is determined in step S4 that the voice uttered by the userincludes no unregistered word, that is, if the voice recognition resultis obtained without applying the grammatical rule including unregisteredword, “$pat1=$color1 $garbage $color2;”, the process is ended withoutperforming step S5.

[0126] On the other hand, if it is determined in step S4 that the voiceuttered by the user include an unregistered word, that is, if the voicerecognition result is obtained by applying the grammatical ruleincluding unregistered word, “$pat1=$color1 $garbage $color2;”, theprocess proceeds to step S5. In step S5, the matching unit 23 detects avoice period of the unregistered word by detecting a voice periodcorresponding to variable $garbage in the grammatical rule includingunregistered word. The matching unit 23 also detects a phoneme sequenceof the unregistered word by detecting a phoneme sequence correspondingto the garbage model indicated by the variable $garbage, wherein thegarbage models allows an arbitrary phonemic transition. The resultantvoice period and the phoneme sequence of the unregistered word aresupplied to the unregistered word period processing unit 27, and theprocess is ended.

[0127] The feature vector sequence output from the feature extractionunit 22 is temporarily stored in the unregistered word period processingunit 27. If the unregistered word period processing unit 27 receives thevoice period and the phoneme sequence of the unregistered word from thematching unit 23, the unregistered word period processing unit 27detects a feature vector sequence of the voice in the voice period.Furthermore, the unregistered word period processing unit 27 assigns anID to the unregistered word (phoneme sequence of the unregistered word)supplied from the matching unit 23 and supplies it, together wit thephoneme sequence of the unregistered word and the feature vectorsequence of the voice period, to the feature vector buffer 28.

[0128] After the ID, the phoneme sequence, and the feature vectorsequence of the new unregistered word are stored in the feature vectorbuffer 28 in the above-described manner, unregistered word processing isperformed.

[0129]FIG. 10 is a flow chart showing the unregistered word processing.

[0130] In the first step S11 in the unregistered word processing, theclustering unit 29 reads the ID and the phoneme sequence of the newunregistered word from the feature vector buffer 28. Thereafter, theprocess proceeds to step S12.

[0131] In step S12, the clustering unit 29 checks the score sheet storedin the score sheet memory 30 to determine whether there is a clusterwhich has already been obtained (produced).

[0132] If it is determined in step S12 that there is no cluster whichhas already been obtained, that is, if the new unregistered word is afirst unregistered word and the score sheet includes no entry of analready-stored unregistered word, then the process proceeds to step S13.In step S13, The clustering unit 29 newly produces a cluster so as toinclude the unregistered word as a representative member thereof, andthe clustering unit 29 registers information associated with the newcluster and the new unregistered word in the score sheet stored in thescore sheet memory 30 thereby updating the score sheet.

[0133] That is, the clustering unit 29 registers, into the score sheet(FIG. 8), the ID and the phoneme sequence of the new unregistered wordread from the feature vector buffer 28. Furthermore, clustering unit 29produces a unique cluster number and registers it in the score sheet asthe cluster number of the new unregistered word. The clustering unit 29registers, in the score sheet, the ID of the new unregistered word asthe representative member ID of the new registered word. Thus, in thiscase, the new unregistered word becomes the representative member of thenew cluster.

[0134] In this case, because there is no already-stored unregisteredword whose score with respect to the new unregistered word should becalculated, the score calculation is not performed.

[0135] After the completion of step S13, the process proceeds to stepS22. In step S22, on the basis of the score sheet updated in step S13,the maintenance unit 31 updates the word dictionary stored in thedictionary memory 25, and the process is ended.

[0136] That is, in this case, because the new cluster has been produced,the maintenance unit 31 recognizes the newly produced cluster bydetecting the cluster number described in the score sheet. Themaintenance unit 31 adds an entry corresponding to that cluster to theword dictionary stored in the dictionary memory 25 and registers, as aphoneme sequence of that entry, the phoneme sequence of therepresentative member of the new cluster, that is, in this case, thephoneme sequence of the new unregistered word.

[0137] On the other hand, if it is determined in step S12 that there isa cluster which has already been obtained, that is, if the newunregistered word is not the first unregistered word and thus the scoresheet (FIG. 8) includes an entry (row) of an already-stored unregisteredword, the process proceeds to step S14. In step S14, the clustering unit29 calculates the score of the new unregistered word with respect to thealready-stored unregistered word and also calculates the score of thealready-stored unregistered word with respect to the new unregisteredword.

[0138] More specifically, for example, if there are N already-storedunregistered words respectively assigned IDs from 1 to N, and if the IDof the new unregistered word is given as N+1, the clustering unit 29calculates the scores s(N+1, 1), s(N+1, 2), . . . , s(N+1, N) for thenew unregistered word with respect to the N already-stored unregisteredwords, respectively, as shown in a row represented by a broken line inFIG. 8, and also calculates the scores s(1, N+1), s(2, N+1), . . . ,s(N, N+1) for the N already-stored unregistered words, respectively,associated with the new unregistered word. When the clustering unit 29calculates the above-described scores, the feature vector sequences ofthe new unregistered word and the N already-stored unregistered wordsare needed, and they are obtained by referring to the feature vectorbuffer 28.

[0139] The clustering unit 29 adds the calculated scores to the scoresheet (FIG. 8) together with the ID and the phoneme sequence of the newunregistered word. Thereafter, the process proceeds to step S15.

[0140] In step S15, the clustering unit 29 checks the score sheet (FIG.8) to detect a cluster including a representative member which gives ahighest score of the new unregistered word with respect to therepresentative member, s(N+1, i) (i=1, 2, . . . , N). That is, theclustering unit 29 detects already-stored unregistered words employed asrepresentative members on the basis of the representative member IDsdescribed in the score sheet and further detects, on the basis of thescores described in the score sheet, an already-stored unregistered wordemployed as a representative member which gives a highest score to thenew unregistered word. The clustering unit 29 then detects a clusterhaving a cluster number corresponding to the detected already-storedunregistered word employed as the representative member of that cluster.

[0141] Thereafter, the process proceeds to step S16. In step S16, theclustering unit 29 adds the new unregistered word to the clusterdetected in step S15 (hereinafter, referred to simply as the detectedcluster) as a member thereof. That is, the clustering unit 29 writes thecluster number of the representative member of the detected cluster, asthe cluster number of the new unregistered word, into the score sheet.

[0142] In the next step S17, the clustering unit 29 divides the detectedcluster into, for example, two clusters. Thereafter, the processproceeds to step 18. In step S18, the clustering unit 29 determineswhether the detected cluster has been divided, in the cluster divisionprocess performed in step S17, into two clusters. If it is determinedthat the detected cluster has been divided into two clusters, then theprocess proceeds to step S19. In step S19, the clustering unit 29determines the cluster-to-cluster distance between the two clustersobtained by dividing the detected cluster (hereinafter, such twoclusters will be referred to as a first sub-cluster and a secondsub-cluster).

[0143] The cluster-to-cluster distance between the first and secondsub-clusters is defined, for example, as follows.

[0144] Herein, let k denote the ID of an arbitrary member (unregisteredword) of the first and second sub-clusters, and let k1 and k2 denote theIDs of representative members (unregistered words) of the first andsecond sub-clusters, respectively. The cluster-to-cluster distancebetween the first and second sub-clusters is given by D(k1, k2) obtainedby the following equation: $\begin{matrix}{{D\left( {{k1},{k2}} \right)} = {\max \quad {{val}_{K}\left( {{abs}\left( {{\log \left( {s\left( {k,{k1}} \right)} \right)} - {\log \left( {s\left( {k,{k2}} \right)} \right)}} \right)} \right.}}} & (2)\end{matrix}$

[0145] In equation (2), abs( ) represent the absolute value of a valueenclosed in parentheses ( ), maxval_(K){ } represents the maximum valueof values obtained by changing k of a variable enclosed in braces { },and log represents a natural or common logarithm.

[0146] Herein, if a member whose ID is i is denoted as a member #i, thereciprocal of the score, 1/s(k, k1), corresponds to the distance betweena member #k and a representative member k1, represented by equation (2),and the reciprocal of the score, 1/s(k, k2), corresponds to the distancebetween the member #k and a representative member k2. Thus, according toequation (2) the cluster-to-cluster distance between the first andsecond sub-clusters is given by the maximum difference between thedistance of any member of the first and second sub-clusters relative tothe representative member #k1 of the first sub-cluster and the distanceof any member of the first and second sub-clusters relative to therepresentative member #k2 of the second sub-cluster.

[0147] Note that the definition of the cluster-to-cluster distance isnot limited to that described above. For example, the sum of distancesin the feature vector space is determined by means of the DP matchingbetween the representative member of the first sub-cluster and therepresentative member of the second sub-cluster, and the sum ofdistances may be employed as the cluster-to-cluster distance.

[0148] After completion of step S19, the process proceeds to step 20. Instep S20, the clustering unit 29 determines whether thecluster-to-cluster distance between the first and second sub-clusters isgreater than (or equal to or greater than) a predetermined threshold ε.

[0149] If it is determined in step S20 that the cluster-to-clusterdistance is greater than the predetermined threshold ε, that is, if itis determined that a plurality of unregistered words included as membersin the detected cluster should be separated into two clusters from thepoint of view of acoustic features thereof, then the process proceeds tostep S21. Instep S21, the clustering unit 29 registers the first andsecond sub-clusters into the score sheet stored in the score sheetmemory 30.

[0150] That is, the clustering unit 29 assigns unique cluster numbers tothe first and second sub-clusters, respectively, and updates the scoresheet such that, of the cluster numbers of the respective members of thedetected cluster, the cluster numbers of those members which have beenclustered into the first sub-cluster are replaced with the clusternumbers of the first sub-cluster, and the cluster numbers of thosemembers which have been clustered into the second sub-cluster arereplaced with the cluster numbers of the second sub-cluster.

[0151] The clustering unit 29 updates the score sheet such that the IDof the representative member of the first sub-cluster is employed as therepresentative member ID of members clustered into the first sub-clusterand the ID of the representative member of the second sub-cluster isemployed as the representative member ID of members clustered into thesecond sub-cluster.

[0152] The cluster number of the detected cluster may be employed asthat of either one of the first or second sub-cluster.

[0153] After the clustering unit 29 has registered the first and secondsub-clusters into the score sheet in the above-described manner, theprocess proceeds from step S21 to S22. In step S22, the maintenance unit31 updates the word dictionary stored in the dictionary memory 25 on thebasis of the score sheet. Thereafter, the process is ended.

[0154] That is, in this case, because the detected cluster has beendivided into the first and second sub-clusters, the maintenance unit 31first deletes the entry corresponding to the detected cluster from theword dictionary. The maintenance unit 31 then adds two entriescorresponding to the first and second sub-clusters to the worddictionary and registers the phoneme sequence of the representativemember of the first sub-cluster as the phoneme sequence of the entrycorresponding to the first sub-cluster and the phoneme sequence of therepresentative member of the second sub-cluster as the phoneme sequenceof the entry corresponding to the second sub-cluster.

[0155] On the other hand, in a case in which it is determined in stepS18 that the detected cluster was not divided into two sub-clusters inthe cluster division process in step S17, or in a case in which it isdetermined in step S20 that the cluster-to-cluster distance between thefirst and second sub-clusters is not greater than the predeterminedthreshold E (that is, in a case in which the acoustic features of theunregistered words included as members in the detected cluster aredetermined to be not so different that the detected cluster should bedivided into the first and second sub-clusters), the process proceeds tostep S23. In step S23, the clustering unit 29 determines a newrepresentative member of the detected cluster and updates the scoresheet.

[0156] That is, the clustering unit 29 refers to the score sheet storedin the score sheet memory 30 to detect scores s(k′, k) needed incalculation of equation (1) for each member of the detected cluster towhich the new unregistered word has been added as a member thereof. Theclustering unit 29 determines the ID of a member to be employed as thenew representative member of the detected cluster in accordance withequation (1) using the detected scores s(k′, k) The clustering unit 29then rewrites the score sheet (FIG. 8) such that the representativemember ID of each member of the detected cluster is replaced with the IDof the new representative member of the detected cluster.

[0157] Thereafter, the process proceeds to step S22. In step S22, themaintenance unit 31 updates the word dictionary stored in the dictionarymemory 25 on the basis of the score sheet, and the process is ended.

[0158] That is, in this case, the maintenance unit 31 refers to thescore sheet to detect the new representative member of the detectedcluster and further detect the phoneme sequence of that representativemember. The maintenance unit 31 then updates the word dictionary byreplacing the phoneme sequence of the entry corresponding to thedetected cluster with the phoneme sequence of the new representativemember of the detected cluster.

[0159] Referring now to a flow chart shown in FIG. 11, the clusterdivision process in step S17 shown in FIG. 10 is described in furtherdetail below.

[0160] In the cluster division process, first in step S31, theclustering unit 29 selects a combination of arbitrary two members, whichhave not been selected yet, from the detected cluster to which the newunregistered word has been added as a member, and the clustering unit 29employs those two members as provisional representative members.Hereinafter, those two provisional representative members will bereferred to as a first provisional representative member and a secondprovisional representative member, respectively.

[0161] In the next step S32, the clustering unit 29 determines whetherthe members of the detected cluster can be divided into two sub-clusterssuch that the first and second provisional representative members becomethe representative members of the respective two sub-clusters.

[0162] In order to determine whether the first and second provisionalrepresentative members can be the representative members, it isnecessary to calculate equation (1), wherein the scores s(k′, k) used inthe calculation can be obtained by referring to the score sheet.

[0163] In a case in which it is determined in step S32 that the membersof the detected cluster cannot be divided into two sub-clusters suchthat the first and second provisional representative members become therepresentative members of the respective two sub-clusters, the processjumps to step S34 without performing step S33.

[0164] However, if it is determined in step S32 that the members of thedetected cluster can be divided into two sub-clusters such that thefirst and second provisional representative members become therepresentative members of the respective two sub-clusters, the processproceeds to step S33. In step S33, the clustering unit 29 divides themembers of the detected cluster into two sub-clusters such that thefirst and second provisional representative members become therepresentative members of the respective two sub-clusters, and theclustering unit 29 employs the combination of the two sub-clusters ascandidates for the final first and second sub-clusters into which thedetected cluster should be divided (hereinafter, such candidates will bereferred to as a combination of candidate sub-clusters). Thereafter, theprocess proceeds to step S34.

[0165] In step S34, the clustering unit 29 determines whether thedetected cluster includes a combination of two members which has not yetbeen selected as a combination of first and second provisionalrepresentative members. If there is such a combination of two members,the process returns to step S31 to select a combination of two members,which has not yet been selected as a combination of first and secondprovisional representative members, from the members of the detectedcluster. Thereafter, the process is performed in a similar manner asdescribed above.

[0166] On the other hand, if it is determined in step S34 that there isno such a combination of two members of the detected cluster which hasnot yet been selected as a combination of first and second provisionalrepresentative members, the process proceeds to step S35. In step S35,the clustering unit 29 determines whether there is a combination ofcandidate sub-clusters.

[0167] If it is determined in step S35 that there is no combination ofcandidate sub-clusters, the flow exits from the process withoutperforming step S36. In this case, it is determined in step S18 in FIG.10 that the detected cluster cannot be divided.

[0168] On the other hand, if it is determined in step S35 that there isa combination of candidate sub-clusters, the process proceeds to stepS36. In step S36, if there is a plurality of combinations of candidatesub-clusters, the clustering unit 29 determines the cluster-to-clusterdistance between the two candidate sub-clusters for each combination ofcandidate sub-clusters. The clustering unit 29 then determines acombination of candidate sub-clusters having a minimumcluster-to-cluster distance and clustering unit 29 returns thatcombination of candidate sub-clusters as a result of division of thedetected cluster, that is, as the first and second sub-clusters. In acase in which there is only one combination of candidate sub-clusters,those candidate sub-clusters are employed as the first and secondsub-clusters.

[0169] In this case, it is determined in step S18 in FIG. 10 that thedetected cluster has been divided.

[0170] As described above, the clustering unit 29 detects, from theexisting clusters each including an unregistered word, a cluster(detected cluster) to which the new unregistered word is to be added asa member thereof, and the clustering unit 29 divides the detectedcluster on the basis of the members of the detected cluster so that thenew unregistered word becomes a new member of the detected cluster,thereby easily clustering the unregistered words such that unregisteredwords having similar acoustic features are clustered into the samecluster.

[0171] Furthermore, the maintenance unit 31 updates the word dictionaryon the basis of the result of the clustering, and thus it becomespossible to easily register an unregistered word into the worddictionary without causing a great increase in the size of the worddictionary.

[0172] Furthermore, even if the matching unit 23 detects a wrong voiceperiod for an unregistered word, such an unregistered word is clustered,in the process of dividing the detected cluster, into a clusterdifferent from a cluster into which an unregistered word whose voiceperiod has been correctly detected. Although an entry corresponding tothe cluster in which the unregistered word has been improperly clusteredis registered in the dictionary, the phoneme sequence never have a largecontribution to the scores in voice recognition performed after that,because the phoneme sequence of that entry corresponds to the voiceperiod which was not correctly detected. Therefore, even if a wrongdetection occurs for a voice period of an unregistered word, the wrongdetection does not have a significant influence on the following voicerecognition.

[0173]FIG. 12 shows a result of a simulation of clustering for anutterance including an unregistered word. In FIG. 12, each entry (row)represents one cluster. In FIG. 12, a phoneme sequence of arepresentative member (unregistered word) of each cluster is describedon the left-hand side, and an utterance of the unregistered wordincluded as a member in the cluster and the number of utterances aredescribed on the right-hand side.

[0174] More specifically, in FIG. 12, for example, in the entry in thefirst row, described is a cluster including, as only one member, anutterance of an unregistered word “

” (a Japanese word pronounced as “furo” and corresponding to an Englishword “bath”), and “doroa:” is obtained as the phoneme sequence of therepresentative member. As another example, in the entry in the secondrow, there is described a cluster including, as members, threeutterances of the unregistered word “

” (bath), and “kuro” is obtained as the phoneme sequence of therepresentative member.

[0175] As still another example, in the entry in the seventh row, thereis described a cluster including, as members, four utterances of anunregistered word “

” (a Japanese word pronounced as “hon” and corresponding to an Englishword “book”), and “NhoNde:su” is obtained as the phoneme sequence of therepresentative member. In the entry in the eighth row, there isdescribed a cluster including, as members, one utterance of anunregistered word “

” a (Japanese word pronounced as “orenji” and corresponding to anEnglish word “orange”) and nineteen utterances of the unregistered word“

” (book), and “ohoN” is obtained as the phoneme sequence of therepresentative member. The other entries are also described in a similarmanner.

[0176] As can be seen from FIG. 12, good results are obtained inclustering for utterances of the same unregistered word.

[0177] In the entry in the eighth row in FIG. 12, one utterance of theunregistered word “

” (orange) and nineteen utterances of the unregistered word “

” (book) are clustered into the same cluster. Judging from theutterances included as members in this cluster, this cluster should beof the unregistered word “

” (book). However, the cluster also includes as a member an utterance ofthe unregistered word “

” (orange). If utterances of the unregistered word “

” (book) are further input, then this cluster will be divided into acluster including, as members, only utterances of the unregistered word“

” (book) and a cluster including, as a member, only the utterance of theunregistered word “

” (orange).

[0178] Although the present invention has been described above withreference to embodiments in which the invention is applied to theentertainment robot (pet robot), the invention can also be applied to awide variety of apparatuses or systems in which a voice recognitionapparatus is used, such as a voice interactive system. Furthermore, thepresent invention can be applied not only to actual robots that act inthe real world but also to virtual robots such as that displayed on adisplay such as a liquid crystal display.

[0179] In the first embodiment described above, a sequence of processingis performed by executing the program using the CPU 10A. Alternatively,the sequence of processing may also be performed by dedicated hardware.

[0180] The program may be stored, in advance, in the, memory 10B (FIG.2). Alternatively, the program may be stored (recorded) temporarily orpermanently on a removable storage medium such as a flexible, a CD-ROM(Compact Disc Read Only Memory), an MO (Magnetooptical) disk, a DVD(Digital Versatile Disc), a magnetic disk, or a semiconductor memory. Aremovable storage medium on which the program is stored may be providedas so-called packaged software thereby allowing the program to beinstalled on the robot (memory 10B).

[0181] The program may also be installed into the memory 10B bydownloading the program from a site via a digital broadcasting satelliteand via a wireless or cable network such as a LAN (Local Area Network)or the Internet.

[0182] In this case, when the program is upgraded, the upgraded programmay be easily installed in the memory 10B.

[0183] In the examples described above, the processing steps describedin the program to be executed by the CPU 10A for performing variouskinds of processing are not necessarily required to be executed in timesequence according to the order described in the flow chart. Instead,the processing steps may be performed in parallel or separately (bymeans of parallel processing or object processing).

[0184] The program may be executed either by a single CPU or by aplurality of CPUs in a distributed fashion.

[0185] The voice recognition unit 50A shown in FIG. 4 maybe realized bymeans of dedicated hardware or by means of software. When the voicerecognition unit 50A is realized by software, a software program isinstalled on a general-purpose computer or the like.

[0186]FIG. 13 illustrates an example of the configuration of a computeron which the program used to realize the voice recognition unit 50A isinstalled.

[0187] That is, another embodiment of a voice recognition apparatus 91according to the present invention is shown in FIG. 13.

[0188] As shown in FIG. 13, the program may be stored, in advance, on ahard disk 105 serving as a storage medium or in a ROM 103 which aredisposed inside the computer.

[0189] Alternatively, the program may be stored (recorded) temporarilyor permanently on a removable storage medium 111 such as a flexibledisk, a CD-ROM, an MO disk, a DVD, a magnetic disk, or a semiconductormemory. Such a removable storage medium 111 may be provided in the formof so-called package software.

[0190] Instead of installing the program from the removable storagemedium 111 onto the computer, the program may also be transferred to thecomputer from a download site via a digital broadcasting satellite bymeans of wireless transmission or via a network such as an LAN (LocalArea Network) or the Internet by means of cable communication. In thiscase, the computer receives, using a communication unit 108, the programtransmitted in the above-described manner and installs the receivedprogram on the hard disk 105 disposed in the computer.

[0191] The voice recognition apparatus 91 includes a CPU (CentralProcessing Unit) 102. The CPU 102 is connected to an input/outputinterface 110 via a bus 101 so that when a command issued by operatingan input unit 107 including a keyboard, a mouse, a microphone, and an ADconverter is input via the input/output interface 110, the CPU 102executes the program stored in a ROM (Read Only Memory) 103 in responseto the command. Alternatively, the CPU 102 may execute a program loadedin a RAM (Random Access Memory) 104 wherein the program may be loadedinto the RAM 104 by transferring a program stored on the hard disk 105into the RAM 104, or transferring a program which has been installed onthe hard disk 105 after being received from a satellite or a network viathe communication unit 108, or transferring a program which has beeninstalled on the hard disk 105 after being read from a removablerecording medium 111 loaded on a drive 109. By executing the program,the CPU 102 performs the process described above with reference to theflow chart or the process described above with reference to the blockdiagrams. The CPU 102 outputs the result of the process, as required, toan output unit 106 including an LCD (Liquid Crystal Display), a speaker,and a DA (Digital Analog) converter via the input/output interface 110.The result of the process may also be transmitted via the communicationunit 108 or may be stored on the hard disk 105.

[0192]FIG. 14 shows an example of the configuration of the softwareprogram of the voice recognition apparatus 91. This software programincludes a plurality of modules. Each module has its own independentalgorithm, and each module executes a particular operation in accordanceits algorithm. Each module is stored in the RAM 13 and read and executedby the CPU 11.

[0193] The respective modules shown in FIG. 14 correspond to the blocksshown in FIG. 4. More specifically, an acoustic model buffer 133corresponds to the acoustic model memory 24, a dictionary buffer 134 tothe dictionary memory 25, a grammar buffer 135 to the grammar memory 26,a feature extraction module 131 to the feature extraction unit 22, amatching module 132 to the matching unit 23, an unregistered word periodprocessing module 136 to the unregistered word period processing unit27, a feature vector buffer 137 to the feature vector buffer 28, aclustering module 138 to the clustering unit 29, a score sheet buffer139 to the score sheet memory 30, and a maintenance module 140 to themaintenance unit 31.

[0194] Note that in the present example, in the input unit 107 shown inFIG. 13, an input analog voice signal obtained via the microphone issupplied to the AD converter, which converts the input analog voicesignal to digital voice data by means of an A/D (Analog/Digital)conversion including sampling and quantizing, and the resultant digitalvoice data is supplied to the feature extraction module 131.

[0195] In the present example, the feature vector buffer 137 stores, forexample, as shown in FIG. 15, an ID, a phoneme sequence, a featurevector sequence, and a storage time of an unregistered word suppliedfrom the unregistered word period processing module 136 such that theyare related to each other. In other words, the feature vector buffer 137stores a set of data indicating entries (in rows) corresponding torespective unregistered words.

[0196] In the example shown in FIG. 15, sequential numerals startingfrom 1 are assigned to respective unregistered words. Therefore, forexample, when IDs, phoneme sequences, feature vector sequences, andstorage times for N unregistered words have been stored in the featurevector buffer 137, if the matching module 132 detects a voice period anda phoneme sequence of another unregistered word, the unregistered wordperiod processing module 136 assigns N+1 as an ID to the detectedunregistered word, and the feature vector buffer 137 stores the ID(N+1), the phoneme sequence, the feature vector sequence, and thestorage time of that unregistered word as represented by broken lines inFIG. 15.

[0197] The entries shown in FIG. 15 are similar to the respectiveentries shown in FIG. 7 except that each entry shown in FIG. 15 includesadditional data indicating a storage time. Each storage time indicates atime at which an entry is stored (recorded) into the feature vectorbuffer 137. The method of using the storage time will be describedlater.

[0198] As will be described later, when the clustering module 138performs clustering on a new unregistered word, the clustering module138 refers to the “feature vectors” stored in the feature vector buffer137. Hereinafter, such “voice information”, which is referred to whenclustering is performed on an unregistered word, will be called“utterance information”.

[0199] That is, the “utterance information” is not limited to “featurevectors” but a “PCM (Pulse Code Modulation) signal” such as voice datasupplied to the feature extraction module 131 may also be employed asutterance information. In this case, the feature vector buffer 137stores the “PCM signal” instead of the “feature vector sequence”.

[0200] Thus, the voice recognition apparatus 91 formed of the modulesdescribed above can operate in a similar manner as the voice recognitionunit 50A shown in FIG. 4, although the detailed structure and operationof the modules corresponding to the voice recognition unit 50A are notdescribed herein.

[0201] The voice recognition unit 50A needs to store voice waveforms(for example, digital voice data) of clusters of unregistered words orfeature vectors (for example, MFCCs (Mel Frequency CepstrumCoefficients) obtained by performing MFCC analysis on digital voicedata) as utterance information in a particular memory area or thefeature vector buffer 28 serving as a memory so that the utteranceinformation can be used to perform clustering on a newly inputunregistered word.

[0202] That is, in the above-described process, when the voicerecognition unit 50A detects a cluster to which an unregistered word isto be added as a new member, from existing clusters obtained byperforming clustering on voices, the voice recognition unit 50A refersto the past utterance information stored in the particular storage areaor the feature vector buffer 28 serving as the memory.

[0203] If the voice recognition unit 50A stores utterance informationcorresponding to all unregistered words one after another, then a largestorage area or memory area is consumed with increasing number of inputunregistered words (with increasing number of acquired unregisteredwords).

[0204] In the embodiment shown in FIG. 14, to avoid the above problem,there is additionally provided a feature vector deletion module 141 for,when a predetermined condition is satisfied, deleting particularutterance information and associated data from the feature vector buffer137.

[0205] More specifically, for example, the feature vector deletionmodule 141 checks a score sheet, similar to that shown in FIG. 8, storedin the score sheet buffer 139 to determine whether the number of membersbelonging to a particular cluster has exceeded a predetermined firstnumber. If it is determined that the number of such members is greaterthan the predetermined first number, the feature vector deletion module141 deletes utterance information and associated data of a second numberof members of the members belonging to the particular cluster from thefeature vector buffer 137. Herein, data associated with a memberincludes an ID and a phoneme sequence of that member and also data ofthe member described on the score sheet.

[0206] Thus, the feature vector deletion module 141 prevents clustersfrom becoming greater than a predetermined size, thereby not onlysuppressing consumption of memory (such as the RAM 103) but alsopreventing a reduction in operation speed of the voice recognitionapparatus 91, that is, preventing degradation in performance of thevoice recognition apparatus 91.

[0207] Note that the first and second numbers described above areselected such that the first number is equal to or greater than thesecond number. The second number of members to be deleted may beselected, for example, in the order of storage time from oldest tonewest according to the storage time data shown in FIG. 15.

[0208] Furthermore, if the feature vector deletion module 141 determinesthat data supplied from the not-referred-to time calculation module 142indicates that a particular cluster has not been referred to at all overa period of time equal to or longer than a predetermined length, thefeature vector deletion module 141 deletes utterance information andassociated data of members of that particular cluster from the featurevector buffer 137.

[0209] More specifically, for example, the not-referred-to timecalculation module 142 checks the feature vector buffer 137 to detectthe latest one of times (storage times shown in FIG. 15) at whichutterance information of respective members belonging to a particularcluster were stored into the feature vector buffer 137 (that is, thenot-referred-to time calculation module 142 detects the time at which anentry of an unregistered word clustered last into the particular clusterwas stored into the feature vector buffer 137), and the not-referred-totime calculation module 142 employs the detected latest time as the lastreference time of that particular cluster.

[0210] The not-referred-to time calculation module 142 then subtractsthe detected last reference time from the current time to determine thenot-referred-to time during which the particular cluster has not beenreferred to and supplies the not-referred-to time to the feature vectordeletion module 141.

[0211] Although in the present embodiment, the not-referred-to timecalculation module 142 calculates the not-referred-to time for allclusters at predetermined time intervals, there is no limitation on thenumber of clusters the not-referred-to time of which is calculated. Forexample, the not-referred-to time calculation module 142 may calculatethe not-referred-to time only for clusters specified by a user.

[0212] Furthermore, the method of the calculation performed by thenot-referred-to time calculation module 142 is not limited to thatdescribed above. For example, although in the above example thenot-referred-to time is calculated on the basis of the storage timesstored in the feature vector buffer 137, the storage times are notnecessarily needed to be stored in the feature vector buffer 137, butthe not-referred-to time calculation module 142 may directly detect andstore the last reference time of a particular cluster and may calculatethe not-referred-to time on the basis of the stored last reference time.

[0213] In the above example, on the basis of the not-referred-to timesupplied from the not-referred-to time calculation module 142, thefeature vector deletion module 141 deletes utterance information andassociated data of all members belonging to a cluster to which no newmember has not been registered for a long time, from the feature vectorbuffer 137. Alternatively, instead of deleting all member of such acluster, utterance information and associated data of only some membersof the cluster may be deleted.

[0214] Furthermore, although in the example described above, the storagetime of a member (unregistered word) registered last in a cluster isemployed as the last reference time of that cluster, the last referencetime of a cluster may be determined in a different manner. For example,a time at which a cluster is detected in step S15 in FIG. 10, or a timeat which sub-clusters are registered in step S21, or a time at which acluster is referred to in some process may be employed as the lastreference time.

[0215] When the feature vector deletion module 141 receives a deletecommand (trigger signal) indicating that a particular cluster should bedeleted via the input unit 107 (for example, a keyboard) the featurevector deletion module 141 may delete utterance information andassociated data of part or all of members belonging to that particularcluster from the feature vector buffer 137.

[0216] When the voice recognition apparatus 91 is disposed on the petrobot shown in FIG. 1, if the feature vector deletion module 141 deletesa particular feature vector sequence in response to not an internalstate of the voice recognition apparatus 91 but an external stimulus,amnesia due to a strong stimulus can be realized in the robot.

[0217] Furthermore, for example, if the level of emotion (emotion level)supplied from the emotion control module 143 is greater than apredetermined value (level), the feature vector deletion module 141 maydelete utterance information and associated data of part or all ofmembers belonging to a particular cluster from the feature vector buffer137.

[0218] In a case in which the voice recognition apparatus 91 is providedin the robot shown in FIG. 1, the information control module 143 can berealized by the model memory 51 shown in FIG. 3. That is, in this case,the model memory 51 supplies information indicating an emotion level asstate information to the feature vector deletion module 141, wherein theemotion level indicates the state of emotion, instinct, or growthindicated by a value of the emotion model, the instinct model, and thegrowth model.

[0219] If the feature vector deletion module 141 deletes particularutterance information stored in the feature vector buffer 137 on thebasis of the emotion level (parameter value of emotion (value of amodel)) supplied from the emotion control module 143, it is possible torealize a loss in memory in a robot in response to an occurrence ofstrong angry or the like in the robot shown in FIG. 1 (in response to anincrease in the parameter associated with the “angry” beyond apredetermined value).

[0220] When the data supplied from the used-memory-area calculationmodule 144 indicates that the total amount of used memory space (forexample, the memory space of the RAM 103 or the like, shown in FIG. 1,including the feature vector buffer 137 and the score sheet buffer 139)has exceeded a predetermined value, the feature vector deletion module141 may delete utterance information and associated data of part or allof members belonging to a particular cluster from the feature vectorbuffer 137.

[0221] More specifically, the used-memory-area calculation module 144always calculates the total amount of used memory space (total memoryconsumption amount) and supplies data indicating the total amount ofused memory space to the feature vector deletion module 141 atpredetermined time intervals.

[0222] As described above, the feature vector deletion module 141 alwaysmonitors the amount of memory consumption (of the RAM 103 or the like),and if the amount of memory consumption has exceeded the predeterminedvalue, the feature vector deletion module 141 reduce the amount ofmemory consumption by deleting utterance information and associated dataof members belonging to a cluster stored in the feature vector buffer137, thereby not only suppressing consumption of memory (such as the RAM103) but also preventing a reduction in operation speed of the voicerecognition apparatus 91, that is, preventing degradation in performanceof the voice recognition apparatus 91.

[0223] Although in the example described above, the feature vectordeletion module 141 determines whether the value of a parameter hasexceeded a predetermined value, wherein the parameter may be the numberof members of a cluster (the number of entries associated with therespective members of the same cluster stored in the feature vectorbuffer 137), the not-referred-to time indicated by the not-referred-totime calculation module 142, the emotion level indicated by the emotioncontrol module 143, or the memory consumption amount indicated by theused-memory-area calculation module 144, and if the feature vectordeletion module 141 determines that the value of the parameter hasexceeded the predetermined value, then the feature vector deletionmodule 141 determines that the predetermined condition is satisfied andthe feature vector deletion module 141 deletes part or all of themembers of the cluster, the method of deleting members (utteranceinformation and associated data of members) is not limited to such amethod.

[0224] For example, instead of making the judgment on the parametervalue, the feature vector deletion module 141 may determine that theparticular condition is satisfied when a trigger signal (such as adelete command supplied via the input unit 107) is input, and may deleteparticular utterance information.

[0225] In this case, the emotion control module 143, the not-referred-totime calculation module 142, and the used-memory-area calculation module144 may make the above-described judgment on the parameter value, and ifit is determined in the judgment process that the value of the parameter(the emotion level, the not-referred-to time, or the total amount ofused memory space) associated with one of these modules is greater thanthe predetermined threshold, a trigger signal may be supplied to thefeature vector deletion module 141.

[0226] The trigger signal supplied to the feature vector deletion module141 is not limited to that described above, but another trigger signalsuch as that generated by a user when an arbitrary condition issatisfied may also be employed.

[0227] As described above, when the feature vector deletion module 141determines that the particular condition is satisfied, the featurevector deletion module 141 deletes particular utterance information ofmembers stored in the feature vector buffer 137. The utteranceinformation deleted in the above process may be arbitrarily selected(set), and the number of pieces of utterance information deleted mayalso be arbitrarily selected (set). For example, a user or amanufacturer may set the conditions associated with deletion ofutterance information.

[0228] In order to maintain the voice recognition accuracy of the voicerecognition apparatus 91 at a high level without encountering areduction in performance thereof, it is desirable to preferentiallydelete such members described below.

[0229] That is, in a case in which part of members of a cluster isdeleted, if a representative member of that cluster or a member having arather small distance relative to the representative member (that is, amember having a large score associated with the representative member)is deleted, a great change will occur in the structure itself of thatcluster. Therefore, it is desirable to preferentially delete a memberother than such the members described above.

[0230] On the other hand, members of a cluster including a small numberof cluster, members having a large distance relative to a representativemember, and members of a cluster to which a new member has not beenadded for a long time can be regarded as not having a large influence onthe voice recognition accuracy, and thus it is desirable topreferentially delete such members.

[0231] When the feature vector deletion module 141 deletes utteranceinformation and associated data of members stored in the feature vectorbuffer 137, the score sheet stored in the score sheet buffer 139includes data associated with the deleted members.

[0232] Therefore, if the feature vector deletion module 141 deletesutterance information and associated data stored in the feature vectorbuffer 137, the feature vector deletion module 141 also deletes, fromthe score sheet, data associated with the members deleted from thefeature vector buffer 137.

[0233] For example, if the feature vector deletion module 141 deletesdata (ID, phoneme sequence, feature vector sequence (utteranceinformation), and storage time) of an entry (in a row) of an ID of 3shown in FIG. 15, then the feature vector deletion module 141 furtherdeletes, from the score sheet shown in FIG. 8, data (ID, phonemesequence, cluster number, representative member ID, and scores s(3, i)(i=1, . . . , N+1)) of an entry (in a row) of an ID of 3 and alsodeletes scores s(j, 3) (j=1, . . . , N+1) of member having the other IDsassociated with the member of the ID of 3.

[0234] In this case, the clustering module 138 reselects (redetermines)a representative member of a cluster to which the deleted memberbelonged. In the specific example described above, the representativemember of the cluster (with a cluster number of 1) to which the memberhaving the ID of 3 belonged as shown in FIG. 8 is reselected(redetermined). If the reselected representative member is differentfrom the previous one (that is, if a member having an ID other than 1 isreselected as a representative member), a possibility occurs that thestructures of all clusters are changed. Thus, reclustering is performedon unregistered words having any ID.

[0235] The method of reclustering is not limited to a specific one, and,for example, a k-means method may be employed.

[0236] In this case, the clustering module 138 performs processes (1) to(3) described below, wherein it is assumed that N unregistered words areregistered in the score sheet stored in the score sheet buffer 139 andthese unregistered words are clustered into k clusters.

[0237] (1) Of the N unregistered words, arbitrary K unregistered wordsclustered into k clusters such that each of K unregistered words becomesan initial cluster center and the initial cluster center is employed asthe provisional representative member.

[0238] (2) The scores of all data (N unregistered words) associated withthe k representative members are recalculated, and each of the Nunregistered words is registered as a member of a cluster including arepresentative member giving a highest recalculated score to theunregistered word.

[0239] (3) A representative member is selected for each of the kclusters into which members have been newly registered.

[0240] In the process (2) described above, the scores can be determinedby referring to the score sheet without actually calculating the score.However, the clustering module 138 may actually calculate the scores inthe process (2). In this case, utterance information of the Nunregistered words is needed, wherein the unregistered word can beobtained by referring to the feature vector buffer 137.

[0241] When the clustering module 138 actually calculates the scores, ifnot feature vector sequences but PCM signals (voice data) are stored asutterance information in the feature vector buffer 137, the clusteringmodule 138 calculates the scores on the basis of the PCM signals.

[0242] If, as a result of the reclustering using the k-means method, thestructure of a cluster other than a cluster to which a deletedunregistered word previously belonged has been changed, the deletion ofthat unregistered word is regarded as having a large influence, and theclustering module 138 and the feature vector deletion module 141 cancelthe deletion of that unregistered word and also cancel all processes(updating of the score sheet and the reclustering) a rising from thedeletion so as to make associated data return back into the state inwhich the associated data was before the deletion was performed (byperforming the undo process)

[0243] Referring now to a flow chart shown in FIG. 16, the voicerecognition process performed by the voice recognition apparatus 91shown in FIG. 14 is described.

[0244] In this example described below, it is assumed that the datashown in FIG. 15 is stored in the feature vector buffer 137, and thescore sheet shown in FIG. 8 is stored in the score sheet buffer 139.Furthermore, it is assumed that utterance information is represented inthe form of a feature vector sequence.

[0245] In step S101, the feature vector deletion module 141 determineswhether a command for deleting an unregistered word has been issued.

[0246] In the present example, the feature vector deletion module 141determines that the command for deleting an unregistered word has beenissued, when one of conditions (1) to (5) described below is satisfied:

[0247] (1) if the number of members belonging to particular clusters ofthe clusters registered in the score sheet stored in the score sheetbuffer 139 becomes greater than a predetermined number;

[0248] (2) if the data supplied from the not-referred-to timecalculation module 142 indicates that the not-referred-to time of aparticular cluster becomes longer than a predetermined time;

[0249] (3) if a delete command (trigger signal) is supplied via theinput unit 107;

[0250] (4) if the value of a parameter of emotion (emotion level)supplied from the emotion control module 143 is greater than apredetermined value (level); or

[0251] (5) if the data supplied from the used-memory-area calculationmodule 144 indicates that the total amount of used memory space (of theRAM 103 or the like) is greater than a predetermined value.

[0252] If the feature vector deletion module 141 determines in step S101that the command for deleting an unregistered word has been issued, thenthe feature vector deletion module 141 executes, in the next step S102,an “unregistered word deletion routine” on the unregistered wordcommanded to be deleted (hereinafter, referred to as the unregisteredword to be deleted). Thereafter, the process returns to step S101 toagain determine whether an unregistered word is commanded to be deleted.

[0253] The details of the “unregistered word deletion process” aredescribed in FIG. 17. Referring to FIG. 17, the “unregistered worddeletion process” is described below.

[0254] First, in step S121, the feature vector deletion module 141deletes, of the data stored in the feature vector buffer 137, such datacorresponding to an unregistered word to be deleted.

[0255] For example, in a case in which an unregistered word of an ID of3 shown in FIG. 15 is to be deleted, of the data shown in FIG. 15, data(ID, phoneme sequence, feature vector (utterance information), andstorage time) of an entry (in a row) of the ID of 3 is deleted.

[0256] Instep S122, the feature vector deletion module 141 corrects thescore sheet stored in the score sheet buffer 139.

[0257] For example, if the data of the entry of the ID of 3 has beendeleted in step S121 in the above-described manner, then, in step S122,of the data stored in the score sheet shown in FIG. 8, data (ID, phonemesequence, cluster number, representative member ID, and scores s(3, i)(i=1, . . . , N+1)) of an entry (in a row) of the ID of 3 is deleted,and scores s(j, 3) (j=1, . . . , N+1) of unregistered words having IDsother than 3 associated with the deleted unregistered word having the IDof 3 are deleted.

[0258] In step S123, the clustering module 138 reselects (redetermines)a representative member for the cluster to which the deletedunregistered word previously belonged.

[0259] In this specific example, the unregistered word having the ID of3 was deleted, and thus a representative member for a cluster having acluster number of 1 (to which the unregistered word having the ID of 3previously belonged) described in the score sheet shown in FIG. 8 isreselected in the above-described manner.

[0260] In step S124, the clustering module 138 determines whether thereselected representative member is different from the previousrepresentative member (that is, whether the representative memberreselected in step S123 is different from the immediately previousrepresentative member). If no change is detected in the representativemember, the flow returns. That is, step S102 in FIG. 16 is ended and theprocess returns to step S101 to repeat the process from step S101.

[0261] For example, in a case in which a member having an ID of 1 wasreselected in step S123 as the representative member, it is determinedthat no change has occurred in the representative member. On the otherhand, in a case in which a member having an ID other than 1 wasreselected as the representative member, it is determined that therepresentative member has been changed.

[0262] In a case in which the clustering module 138 has determined instep S124 that the representative member has been changed, the processproceeds to step S125 in which the clustering module 138 performsreclustering on all unregistered words (in the present example, allunregistered words registered in the score sheet shown in FIG. 8 exceptfor the unregistered word having the ID of 3). That is, the clusteringmodule 138 performs reclustering on all unregistered words by means of,for example, the k-means method described earlier.

[0263] In step S126, the clustering module 138 determines whether achange has occurred in the structure of a cluster other than the clusterto which the deleted unregistered word previously belonged (morespecifically, for example, clustering module 138 determines whether achange in terms of members belonging to a cluster has occurred orwhether a representative member of a cluster has been changed to anothermember). If it is determined that no change has occurred in clusterstructure, the process proceeds to step S128. In step S128, themaintenance module 140 updates the word dictionary stored in thedictionary buffer 134 on the basis of the score sheet updated(corrected) in step S122. Thereafter, the flow returns.

[0264] That is, in the present case, when a new representative memberfor the cluster to which the deleted unregistered word previouslybelonged is reselected (in step S123), the new representative member isdifferent from the previous representative member (step S124), and thusmaintenance module 140 refers to the score sheet to detect clusterswhose representative members have been redetermined. The maintenancemodule 140 registers, in the word dictionary stored in the dictionarybuffer 134, the phoneme sequences of the new representative members asthe phoneme sequences of entries corresponding to the clusters whoserepresentative member has been redetermined.

[0265] On the other hand, if the clustering module 138 determines instep S126 that a change has occurred in cluster structure, then, in stepS127, the clustering module 138 and the feature vector deletion module141 return the contents of the feature vector buffer 137 and the scoresheet buffer 139 into the states in which they were before deletion wasperformed (in step S121). That is, the clustering module 138 and thefeature vector deletion module 141 perform the undo process until thestate, in which associated data was before the unregistered words weredeleted, is reached. Thereafter, the flow returns.

[0266] The process (undo process) in steps S126 and S127 may be removed.That is, in the voice recognition apparatus 91, a change of a clustermay be allowed, and the undo process may not be performed.

[0267] The voice recognition apparatus 91 may be constructed such thatwhether steps S126 and S127 should be performed or not can be selected(by a user or the like) from the outside of the voice recognitionapparatus 91.

[0268] In a case in which all members of a cluster are determined asunregistered words to be deleted in the unregistered word deletionprocess in FIG. 17, and thus all members are deleted, the unregisteredword deletion process is equivalent to deleting the cluster itself themembers belong to. In this case, it is not necessary to determine a newrepresentative member for that cluster (it is impossible to determine anew representative member) Therefore, in this case, after completion ofstep S122, steps S123 and S124 are skipped and the flow proceeds to stepS125 and then to S126. If it is determined in step S126 that no changein cluster structure has occurred, the process proceed to step S128. Instep S128, the maintenance module 140 updates the word dictionary storedin the dictionary buffer 134 on the basis of the score sheet updated(corrected) in step S122. Thereafter, the flow returns.

[0269] More specifically, in this case, all members of a certain clusterare deleted, and thus the cluster itself is deleted. The maintenancemodule 140 refers to the score sheet to detect the deleted cluster. Themaintenance module 140 then deletes an entry corresponding to thedeleted cluster from the word dictionary stored in the dictionary buffer134.

[0270] By deleting an entry corresponding to a certain cluster from theword dictionary stored in the dictionary buffer 134 in theabove-described manner, amnesia or a loss in memory is realized.

[0271] Referring again to FIG. 16, if it is determined in step S101 thatno unregistered word is commanded to be deleted, then, in step S102, thefeature extraction module 131 determines whether a voice has been input.

[0272] If it is determined in step S102 that no voice has been input,the flow returns to step S101 to repeat the process from step S101.

[0273] That is, the feature vector deletion module 141 always checkswhether deletion of an unregistered word (utterance information of anunregistered word stored in the feature vector buffer 137) has beencommanded, and the feature extraction module 131 always checks,independently of the feature vector deletion module 141, whether a voicehas been input.

[0274] Herein, if a user utters, then the uttered voice is passedthrough the microphone and the AD converter in the input unit 107thereby obtaining digital voice data. The resultant digital voice datais supplied to the feature extraction module 131.

[0275] In step S103, the feature extraction module 131 determineswhether a voice has been input. If a voice is input (if it is determinedthat a voice has been input), then, in step S104, acoustic analysis isperformed on the voice data on a frame-by-frame basis thereby extractfeature vectors. The sequence of the extracted feature vectors issupplied to the matching module 132 and the unregistered word periodprocessing module 136.

[0276] Steps S104 to S108 are similar to steps S1 to S5 described abovewith reference to FIG. 9, and thus steps S104 to S108 are not describedherein.

[0277] As described above, if the feature vector deletion module 141determines that the specific condition is satisfied, feature vectordeletion module 141 deletes, from the data stored in the feature vectorbuffer 137, utterance information (feature vector sequence in theexample shown in FIG. 15) of a member regarded as having littleinfluence on clustering and associated data (ID, phoneme sequence, andstorage time in the example shown in FIG. 15), thereby suppressing theconsumption of the storage area without causing degradation in thefunction of automatically acquiring unregistered words.

[0278] Furthermore, the feature vector deletion module 141 corrects datarelated to the member (deletes unnecessary data) described in the scoresheet stored in the score sheet buffer 139 thereby further suppressingthe consumption of the storage area.

[0279] Furthermore, the maintenance module 140 updates the worddictionary on the basis of the corrected score sheet. This makes itpossible to realize amnesia or a loss in memory in the robot, therebymaking the robot capable of providing greater entertainment to users.

[0280] In the embodiments described above, the steps described in theprogram stored in the storage medium may be performed either in timesequence in accordance with the order described in the program or in aparallel or separate fashion.

[0281] Furthermore, there are no particular limitations on the detailsof the respective modules shown in FIG. 14, as long as the modulesprovide necessary functions. For example, modules may be constructed bymeans of hardware. In this case, a manufacturer may connect therespective modules in the manner shown in FIG. 14. In other words,instead of the voice recognition unit 50A shown in FIG. 3, the apparatusrealized by means of hardware constructed in the manner shown in FIG. 14may be used as the voice recognition unit.

[0282] In the embodiments described above, voice recognition isperformed by means of the HMM method. However, in the present invention,voice recognition may also be performed by means of another method suchas a DP matching method. In the case in which voice recognition isperformed by means of the DP matching method, the score employed in theabove-described embodiments correspond to the reciprocal of the distancebetween an input voice and a standard pattern.

[0283] In the embodiments described above, an unregistered word isclustered and the unregistered word is registered in the word dictionaryon the basis of the result of clustering. The present invention may alsobe applied to a word which is already registered in the word dictionary.

[0284] That is, because there is a possibility that different phonemesequences are obtained for utterances of the same word, if only onephoneme sequence is registered for one word in the word dictionary, thenthere occurs a possibility that an utterance of the word is notcorrectly recognized as that word if the phoneme sequence obtained forthe utterance is different from that registered in the word dictionary.In the present invention, the above problem can be solved as follows.That is, different utterances of the same word are clustered such thatacoustically similar utterances belong to the same cluster, and thedictionary is updated on the basis of the result of the clustering. As aresult, different phoneme sequences are registered for the same word inthe word dictionary, and thus it becomes possible to perform voicerecognition for various phonemes for the same word.

[0285] In the word dictionary, in addition to phoneme sequence, keyinformation may be described in an entry corresponding to a cluster ofan unregistered word, as described below.

[0286] For example, in the action decision unit 52, state recognitioninformation output from the image recognition unit 50B or the pressureprocessing unit 50C is supplied to the voice recognition unit 50A asrepresented by a broken line in FIG. 3 so that the maintenance unit 31(FIG. 4) in the voice recognition unit 50A receives the staterecognition information.

[0287] On the other hand, the feature vector buffer 28 and the scoresheet memory 30 also store an absolute time at which an unregisteredword was input. On the basis of the absolute time described in the scoresheet stored in the score sheet memory 30, the maintenance unit 31detects the state recognition information which was supplied from theaction decision unit 52 when the unregistered word was input, andmaintenance unit 31 regards the detected state recognition informationas key information of that unregistered word.

[0288] The maintenance unit 31 registers, in the word dictionary, thestate recognition information as key information in an entrycorresponding to a cluster of the unregistered word so that the entryincludes the state recognition information in addition to the phonemesequence of the representative member of that cluster.

[0289] This makes it possible for the matching unit 23 to output, as aresult of voice recognition for the unregistered word registered in theword dictionary, the state recognition information registered as the keyinformation of the unregistered word. Furthermore, it becomes possibleto make the robot take an action in accordance with the staterecognition information registered as the key information.

[0290] More specifically, for example, when a word “red” isunregistered, if the CCD 16 detects an image of a red object, staterecognition information indicating that a red object has been detectedis supplied from the image recognition unit 50B to the voice recognitionunit 50A via the action decision unit 52. Herein, if a user utters“red”, which is an unregistered word, then the voice recognition unit50A determines a phoneme sequence of the unregistered word “red”.

[0291] In this case, the voice recognition unit 50A adds, as an entry ofthe unregistered word “red” to the word dictionary, the phoneme sequenceof the unregistered word “red” and the key information indicating thestate recognition information “red”.

[0292] If a user utters “red” thereafter, the phoneme sequence of theunregistered word “red” registered in the word dictionary has a highscore for the utterance, and thus the voice recognition unit 50Aoutputs, as a result of voice recognition, the state recognitioninformation “red” registered as the key information.

[0293] The result of the voice recognition is supplied from the voicerecognition unit 50A to the action decision unit 52. In response, forexample, the action decision unit 52 may make the robot search theenvironment for the red object on the basis of the output from the imagerecognition unit 50B and walk toward the red object.

[0294] That is, in this specific example, although the robot cannotrecognize an utterance of “red” when the robot encounters it for thefirst time, if a user utters “red” when the robot is detecting an imageof an red object, then the robot relates the utterance “red” to the redobject being detected as the image, thereby making is possible for therobot to, when the user utters “red” thereafter, recognize the utterancecorrectly as “red” and walk toward a red object present in theenvironment. This gives the user an impression that the robot grows bylearning what the user speaks.

[0295] The voice recognition apparatus 91 shown in FIG. 13 may alsooperate in a similar manner.

[0296] Although in the embodiments described above, the scores arestored in the score sheet, the scores may be recalculated as required.

[0297] Although in the embodiments described above, the detected clusteris divided into two clusters, the detected cluster may be divided intothree or more clusters. The detected cluster may also be divided into anarbitrary number of clusters such that the distances among the clustersbecome greater than a predetermined value.

[0298] In the embodiments described above, not only scores but alsophoneme sequences, cluster numbers, and representative member IDs areregistered in the score sheet (FIG. 8). Information other than thescores is not necessarily needed to be registered in the score sheet,but may be stored and managed separately from the scores.

INDUSTRIAL APPLICABILITY

[0299] According to the present invention, a cluster to which an inputvoice is to be added as a member is detected from existing clustersobtained by clustering voices. The input voice is added as a new memberto the detected cluster, and the cluster is divided depending on themembers of the cluster. On the basis of the result of the division, thedictionary is updated. Thus, it becomes possible to easily register,into the dictionary, a word which has not been registered in thedictionary without resulting in a significant increase in the size ofthe dictionary.

1. A voice recognition apparatus for processing an input voice andupdating a dictionary used in a language processing in accordance with aresult of the processing of the input voice, said voice recognitionapparatus comprising: cluster detection means for detecting, fromexisting clusters obtained by clustering voices, a cluster to which saidinput voice is to be added as a new member; cluster division means foremploying said input voice as the new member of the cluster detected bysaid cluster detection means and dividing said cluster depending onmembers of said cluster; and update means for updating the dictionary onthe basis of a result of division performed by said cluster divisionmeans.
 2. A voice recognition apparatus according to claim 1, whereinthe dictionary stores a phoneme sequence of a vocabulary to berecognized; and the update means updates the dictionary by adding, as anew entry to the dictionary, a phoneme sequence of a voice correspondingto a representative member representing members of a cluster created bythe division or by replacing an entry of the dictionary with the phonemesequence of the voice corresponding to the representative memberrepresenting members of the cluster created by the division.
 3. A voicerecognition apparatus according to claim 1, wherein said clusterdetection means calculates the score of the input voice with respect toeach member of the cluster by determining the likelihood that the inputvoice is observed in the member of the cluster; selects, of the membersof the cluster, a member giving a highest value to said score of theinput voice and employs the selected member as a representative memberrepresenting the members of the cluster; and determines the clusterhaving said representative member as a cluster to which the said inputvoice it to be added as a new member.
 4. A voice recognition apparatusaccording to claim 1, wherein said input voice is an unregistered wordwhich has not been registered, in advance, in the dictionary.
 5. A voicerecognition apparatus according to claim 3, wherein in a case in which,of members of the cluster, a member having a greatest sum of scores withrespect to the other members of that cluster is employed as therepresentative member representing the members of that cluster, saidcluster division means divides the cluster, to which the input voice hasbeen added, into two clusters, that is, first and second clusters, suchthat two members of the original cluster become representative membersof the first and second clusters, respectively.
 6. A voice recognitionapparatus according to claim 5, wherein in a case in which there are aplurality of combinations of two clusters consisting of first and secondclusters, said cluster division means divides a cluster including theinput voice as a member thereof into two clusters such that thecluster-to-cluster distance between the first cluster and the secondcluster becomes smallest.
 7. A voice recognition apparatus according toclaim 6, wherein when a combination of two clusters consisting of firstand second clusters is selected such that the cluster-to-clusterdistance between the first cluster and the second cluster becomessmallest, if said smallest cluster-to-cluster distance is greater than apredetermined threshold, then said cluster division means divides thecluster including the input voice as a member thereof into said twoclusters.
 8. A voice recognition apparatus according to claim 5, furthercomprising storage means for storing the scores of the members of saidcluster with respect to each member of each score.
 9. A voicerecognition apparatus according to claim 1, wherein the dictionarystores a phoneme sequence of a vocabulary to be recognized, and whereinsaid voice recognition apparatus further comprises voice recognitionmeans for recognizing a voice on the basis of an acoustic modelconstructed in accordance with the phoneme sequence stored in saiddictionary.
 10. A voice recognition apparatus according to claim 9,wherein said acoustic model is an HMM (Hidden Markov Model).
 11. A voicerecognition apparatus according to claim 9, wherein said voicerecognition means constructs an acoustic model corresponding to aphoneme sequence stored in the dictionary by concatenating HMMs in unitsof sub-words and recognizes a voice on the basis of said acoustic model.12. A voice recognition apparatus according to claim 9, wherein saidvoice recognition means recognizes a voice also on the basis of apredetermined grammatical rule.
 13. A voice recognition apparatusaccording to claim 12, wherein said voice recognition means extracts aparticular period of the input voice in accordance with thepredetermined grammatical rule; and said cluster detection means andsaid cluster division means perform their processes on said period ofthe input voice.
 14. A voice recognition apparatus according to claim13, wherein said voice recognition means extracts, as said particularperiod, a period of an unregistered word which is not registered in thedictionary from the input voice.
 15. A voice recognition apparatusaccording to claim 14, wherein said voice recognition means extracts theperiod of the unregistered word in accordance with the predeterminedgrammatical rule using a garbage model.
 16. A voice recognitionapparatus according to claim 1, wherein said cluster division meansdivides the cluster by means of an EM (Expectation Maximum) method. 17.A voice recognition apparatus according to claim 1, further comprising:storage means for storing voice information associated with the inputvoice for use by the cluster detection means to detect a cluster; anddeletion means for, when it is determined that a specific condition issatisfied, deleting a particular one of pieces of voice informationstored in the storage means.
 18. A voice recognition apparatus accordingto claim 17, wherein the voice information stored in the storage meansis digital data of the input voice.
 19. A voice recognition apparatusaccording to claim 18, further comprising feature extraction means forextracting a feature vector indicating a specific feature of the inputvoice from the digital data of the input voice, wherein the voiceinformation stored in the storage means is the feature vector of theinput voice extracted by the feature extraction means.
 20. A voicerecognition apparatus according to claim 17, wherein said deletion meansdetermines that the specific condition is satisfied when the number ofmembers belonging to the specific cluster is greater than apredetermined value.
 21. A voice recognition apparatus according toclaim 17, further comprising not-referred-to time calculation means forcalculating the not-referred-to time during which the cluster has notbeen referred to, wherein said deletion means determines that thespecific condition is satisfied when the not-referred-to time of thecluster calculated by the not-referred-to time calculation means isgreater than a predetermined value.
 22. A voice recognition apparatusaccording to claim 17, further comprising input means for inputting atrigger signal, wherein said deletion means determines that the specificcondition is satisfied when the trigger signal is input via the inputmeans.
 23. A voice recognition apparatus according to claim 17, furthercomprising emotion control means for controlling a parameter associatedwith emotion, wherein said deletion means determines that the specificcondition is satisfied when the value of the parameter associated withemotion controlled by the emotion control means is greater than apredetermined value.
 24. A voice recognition apparatus according toclaim 17, further comprising used-memory-area calculation means forcalculating the amount of used memory area of the storage means, whereinsaid deletion means determines that the specific condition is satisfiedwhen the amount of used memory area calculated by the used-memory-areacalculation means is greater than a predetermined value.
 25. A voicerecognition apparatus according to claim 17, further comprisingclustering means for reclustering a voice corresponding to voiceinformation stored in the storage means.
 26. A voice recognitionapparatus according to claim 25, wherein said update means updates thedictionary also on the basis of a result of reclustering performed bythe clustering means.
 27. A voice recognition apparatus according toclaim 25, further comprising representative member selection means for,when the voice information is deleted by the deletion means, selecting arepresentative member representing the members of the cluster to whichthe voice corresponding to the deleted voice information belongs beforebeing deleted, wherein if the new representative member selected by therepresentative member selection means is different from a previousrepresentative member, said clustering means reclusters all voiceinformation stored in the storage means.
 28. A voice recognitionapparatus according to claim 27, further comprising deletion cancelmeans for if the structure of the cluster reclustered by the clusteringmeans is different from the immediately previous structure the clusterhad before being reclustered by the clustering means, canceling thedeletion process performed, by the deletion means, on the voiceinformation such that the original state is obtained.
 29. A voicerecognition apparatus according to claim 27, wherein said clusteringmeans performs reclustering by means of a k-means method.
 30. A voicerecognition method for processing an input voice and updating adictionary used in a language processing in accordance with a result ofthe processing of the input voice, said voice recognition methodcomprising the steps of: detecting, from existing clusters obtained byclustering voices, a cluster to which said input voice is to be added asa new member; employing said input voice as the new member of thecluster detected in said cluster detection step and dividing saidcluster depending on members of said cluster; and updating thedictionary on the basis of a result of division performed in saidcluster division step.
 31. A program for causing a computer to performvoice processing for processing an input voice and updating a dictionaryused in a language processing in accordance with a result of theprocessing of the input voice, said program comprising the steps of:detecting, from existing clusters obtained by clustering voices, acluster to which said input voice is to be added as a new member;employing said input voice as the new member of the cluster detected insaid cluster detection step and dividing said cluster depending onmembers of said cluster; and updating the dictionary on the basis of aresult of division performed in said cluster division step.
 32. Astorage medium including a program stored therein for causing a computerto perform voice processing for processing an input voice and updating adictionary used in a language processing in accordance with a result ofthe processing of the input voice, said program comprising the steps of:detecting, from existing clusters obtained by clustering voices, acluster to which said input voice is to be added as a new member;employing said input voice as the new member of the cluster detected insaid cluster detection step and dividing said cluster depending onmembers of said cluster; and updating the dictionary on the basis of aresult of division performed in said cluster division step.